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Is protein secondary structure primarily determined by local interactions between residues closely 
spaced along the amino acid backbone, or by non-local tertiary interactions? To answer this question 
we have measured the entropy densities of primary structure and secondary structure sequences, 
and the local inter-sequence mutual information density. We find that the important inter-sequence 
interactions are short ranged, that correlations between neighboring amino acids are essentially un- 
informative, and that only 1 /4 of the total information needed to determine the secondary structure 
is available from local inter-sequence correlations. Since the remaining information must come from 
non-local interactions, this observation supports the view that the majority of most proteins fold via 
a cooperative process where secondary and tertiary structure form concurrently. To provide a more 
direct comparison to existing secondary structure prediction methods, we construct a simple hidden 
Markov model (HMM) of the sequences. This HMM achieves a prediction accuracy comparable to 
other single sequence secondary structure prediction algorithms, and can extract almost all of the 
inter-sequence mutual information. This suggests that these algorithms are almost optimal, and 
that we should not expect a dramatic improvement in prediction accuracy. However, local corre- 
lations between secondary and primary structure are probably of under-appreciated importance in 
many tertiary structure prediction methods, such as threading. 
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INTRODUCTION 

The secondary structure of a protein is a summary of 
the general conformation and hydrogen bonding pattern 
of the amino acid backbone 1 . This structure is frequently 
simplified to a sequence (one element per residue) of he- 
lixes (H), extended strands (E) and unstructured loops 
(L). It has long been recognized that each residue's sec- 
ondary structure is appreciably correlated with the local 
amino acid sequence^ and that these correlations may 
be used to predict the secondary structure^, or as a 
contribution to threading potentials 5 *^ and other tertiary 
structure prediction algorithms^. The effectiveness of lo- 
cal secondary structure prediction, and the utility of sec- 
ondary structure potentials, depends upon the extent to 
which a protein's structure, particularly the secondary 
structure, is determined by local, short-ranged interac- 
tions between residues closely spaced along the backbone, 
as opposed to non-local or long-ranged tertiary interac- 
tions. 

The strength, organization and relative importance of 
local sequence-structure interactions can be determined 
with a statistical analysis of the corpus of known protein 
structures. We treat the primary and secondary struc- 
tures of a protein as random sequences composed from 
either the 20 letter amino acid or the 3 letter EHL (Ex- 
tended strand / Helix / Other) structure alphabets, as 
shown in Fig. ^ These sequences contain substantial lo- 
cal sequence and inter-sequence correlations which can be 
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Primary : YDPEEHHKLSHEAESLPSWISSQAAGNAVMMGAGYFSP 
Secondary : LLHHHHHHHHHHHHLLLEEELLHHHHHHHHHHHHLLLLL 

FIG. 1: A protein's amino acid sequence is correlated with 
the corresponding secondary structure sequence, represented 
here by a sequence of helixes (H), extended strands (E) and 
unstructured loops (L). For example, alanines (A) are typ- 
ically associated with helixes, while glycines (G) are often 
located near helix breaks. Also note that secondary structure 
is strongly persistent. Helixes, for example, are on average 
about 10 residues long-. 

quantified using entropic measures^. To ensure accurate 
results we employ a large, carefully curated collection of 
protein structures derived from the Structural Classifi- 
cation Of Proteins (SCOP)A2iii database, that contains 
2,853 sequences. 



Sequence Information 

Entropy is a measure of the information needed to 
describe a random variable^. Specifically, the entropy 
H(X) of a discrete random variable X, measured in bits, 
is defined as 

H(X) = -E( log 2 P(X)) =-J2 p ( x ) lo S 2 P(x), (1) 

where X is the alphabet, the set of allowed states, x is 
an element of X, E(X) is the expectation, and P(x) is 
the probability of state x. When considering the entropy 
of a collection of variables it is important to take into 
account inter-variable correlations. For a statistically ho- 
mogeneous random sequence with local correlations the 
appropriate information measure is the entropy density 
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hp, the rate at which the entropy of the sequence in- 
creases with length: 



ha = hm 
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Here, H(X L ) is the entropy of sequence fragments, X L , 
of length L. The non-extensive excess entropy, Eh, is 
the quantity of information explained away by taking ac- 
count of inter-site correlations. The entropy density is 
also referred to as the entropy rate or metric entropy*. 

A convenient measure of correlation between two dis- 
crete random variables, X and Y, is the mutual informa- 
tion I(X;Y), defined as 



I(X-Y) = H(X) 
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where P(x,y) is the joint probability of observing states 
x and y. If the random variables are independent 
(P(x,y) = P(x)P(y)) then the mutual information 
achieves its lower bound of zero. Mutual information 
cannot exceed the entropy of either variable, and this 
upper bound is reached when the variables are perfectly 
correlated (P(x,y) — P(x) = P(y))- 

The appropriate entropic correlation measure for a pair 
of statistically homogeneous random sequences is the mu- 
tual information density, i u , 



= lim 



I{X L ;Y L ) -Ei 
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Here, Ei is the excess mutual information. 

When we consider the correlations between three ran- 
dom variables, it is often useful to consider I(X;Y\Z), 
the conditional mutual information* of X and Y, given 
a third variable, Z. This quantity can be conveniently 
defined in terms of mutual information 

I(X; Y\Z) = I(X; Y) + I(X, Y; Z) - I(X; Z) - I(Y; Z) 

(6) 

Conditioning on a third random variable may increase or 
decrease the mutual information*. 



RESULTS 
Entropy and Correlations 

In Fig. |21 we plot the entropies for secondary struc- 
ture sequence blocks up to length 9 (3 9 = 19683 states). 
Of the half million residues in our data set, about 23% 
are assigned to strand, 39% to helix, and 38% to other, 
resulting in a relatively large single site secondary struc- 
ture entropy of 1.53 bits. (The maximum entropy for 
three states is log 2 3 » 1.59 bits.) However, neighbor- 
ing secondary structure elements are strongly correlated, 
resulting in a relatively large nearest neighbor mutual 



TABLE I: Summary of Primary Structure (R) and Secondary 
Structure (5*) Sequence and Inter-Sequence Information Mea- 



PRIMARY bits 
residue entropy H(Ri) 4.179 ±0.001 

neighbor mutual info. I(Ri;R i+1 ) 0.006 ±0.002 

conditional neighbor MI I(Ri; R i+1 \SiS i+1 ) 0.0159 ±0.0004 
entropy density K{R) 4.173 ±0.003 



SECONDARY 
residue entropy 
neighbor mutual info, 
entropy density 
excess entropy 

INTER-SEQUENCE 
monomer mutual info, 
dipeptide mutual info, 
mutual info, density 



H(Si) 
I(Si; Si+i) 

MS) 

E h (S) 



1.533 ±0.002 

0.893 ±0.003 

0.598 ±0.001 

0.997 ±0.005 



I{Ri\ SO 0.0813 ±0.0007 

I{RiRi+i\S l S i+ i) 0.208 ±0.002 
i„(R;S) 0.164 ±0.003 



information, I(Si',Si+x) ~ 0.89 bits. A linear regression 
to the asymptotic functional form, H(S L ) ~ Lh^ + Eh 
{L > 3) gives an excess entropy of Eh = 0.997 ± 0.004 
bits, and an entropy density of h u — 0.598±0.001 bits per 
residue. This entropy density, the amount of information 
needed to describe the secondary structure sequence, is 
considerable less than the single site entropy (1.53 bits) 
due to the strong inter-site correlations that may be ob- 
served in Fig. ^ 

It is notable that the entropies for short blocks are 
almost identical to the asymptotic linear extrapolation 
used to estimate entropy density and excess entropy 
(Fig. 12} • This property is indicative of a sequence with a 
simple structure, and suggests that many of the impor- 
tant statistical features of secondary structure sequences 
can be successfully modeled by a low order Markov 
chaini*. 

In contrast to secondary structure, neighboring amino 
acids are only weakly correlated. The nearest neighbor 
mutual information, I(Ri; P4+1) ~ 0.006 bits, is small 
relative to the single site entropy of H(Ri) w 4.18 bits, 
which, consequentially, is almost identical to the primary 
sequence entropy density. Moreover, the mutual infor- 
mation between neighboring amino acids, conditioned 
upon the corresponding secondary structure (Eq. [SJl, 
is also relatively insignificant: I(Ri', B4+1 \SiSi+i) ~ 
0.016 bits. Neighboring amino acids are approximately 
independent^, irrespective of the local structure. The 
correlations between more distantly separated residues 
are also very small. 

The strength of the primary to secondary structure 
sequence correlations is quantified by the inter-sequence 
mutual information density. However, the mutual infor- 
mation can only be directly calculated for short sequence 
blocks due to the large effective alphabet of 60 (— 3 x 20) 
symbols. The observed single site mutual information is 
I(Si;Ri) « 0.081 bits, and the dipeptide mutual infor- 
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FIG. 2: Secondary structure sequences are strongly corre- 
lated, but the correlations have a simple structure. In this fig- 
ure we plot the entropy of secondary structure blocks, H(S L ), 
as a function of block length, L (points). Bootstrapped confi- 
dence intervals are smaller than the data point symbols. The 
linear increase of block entropies is indicative of a simple se- 
quence, one that can, to a good first order approximation, 
be modeled as a low order Markov chain. A linear regression 
to the data (solid line) gives an excess entropy of Eh 1.0 
bits (zero intercept) and a true secondary structure entropy 
density of /i M ~ 0.60 bits per residue. Over half of the single 
site entropy is explained away when we look beyond single 
site statistics. 

mation is I(RiR i+ i; SiS i+ i) w 0.208 bits, or 0.104 bits 
per residue. Fortunately, to a good approximation we 
can neglect the correlations between amino acids, since 
neighboring residues are almost (conditionally) indepen- 
dent. For example, the dipeptide mutual information, 
I(RiRi+i; SiSi+i) s» 0.208 bits, can be approximated by 
I(R l ;S t S t+1 ) + I(Ri + i; SiSi+i) ~ 0.198, an expression 
that explicitly ignores amino acid correlations. The rela- 
tively small error of 0.010 bits (less than 5% of the dipep- 
tide mutual information) is directly related to the mutual 
information between neighboring amino acids, since (by 
Eq.EJ) 

I(RiRi+i, SiSi+i) — I(Ri', SiSi+i) — I(Ri+i; SiSi+i) = 
I(Ri; Ri+i\SiSi+i) — I(Ri;Ri + i). 

It follows that the inter-sequence mutual information 
density can be estimated by examining I C (R ; S ), the 
mutual information between a block of secondary struc- 
ture and the single amino acid located at the center of 
that block. (See Fig. 0) Empirically, we expected these 
entropies to decay exponentially towards their limiting 
value as block lengths increase^ 2 . A nonlinear regression 
to the functional form a — 6exp(— L/c), (using data from 
odd block lengths only), gives c = 3.8 ± 0.3 residues for 
the characteristic length scale, b = 0.108 ± 0.002 for the 
scaling prefactor and a — 0.164±0.003 bits for the central 
amino acid to secondary structure mutual information in 
the infinity block length limit. This last value is a good 
approximation to the inter-sequence mutual information 
density, i^(R;S), with a bias, due to neglecting amino 
acid correlations, that is probably less than 10%. 
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FIG. 3: The direct local interactions between primary and 
secondary structure are short ranged. Here, the mutual infor- 
mation, I C (R , S ), between a block of secondary structure 
of length L and the single amino acid located at the center 
(odd Li) or immediately left of center (even Li) of that block 
is plotted against block length (points). Bootstrapped confi- 
dence intervals are smaller than the data point symbols. A 
non-linear regression to an empirical exponential functional 
form gives a characteristic length scale of about 4 residues, 
and a limiting value of L C {R} , S°°) ~ 0.164, which is a reason- 
able approximation to the total mutual information density, 
t»(R;S). 

In summary, the direct local interactions are short 
ranged, neighboring amino acids are almost independent, 
secondary structure sequences are correlated, but essen- 
tially Markovian, and the important inter-sequence cor- 
relations are local, with a characteristic length scale of 
about 4. The inherent information content of secondary 
structure sequences is 0.60 bits per residue, about 4 
times greater than the 0.16 bits per residue of local mu- 
tual information between primary and secondary struc- 
ture. These measurements place severe constraints on 
any single-sequence prediction algorithm that purports 
to extract secondary structure information from local se- 
quence correlations. In particular, no analysis can ex- 
tract additional information from the signal (the data 
processing inequality^) and therefore, any sequence local 
prediction of secondary structure can contain no more 
information than that contained in the local primary- 
secondary sequence correlations. 

Prediction 

Many different algorithms have been proposed for pre- 
dicting secondary structure from local inter-sequence cor- 
relations. Interestingly, the underlying organization of 
the majority of these algorithms does not reflect the un- 
derlying organization of the intra- and inter-sequence in- 
teractions elucidated in the preceding section. Typically, 
these methods use a large primary structure window of 
around 15 to 27 residues to predict the single secondary 
structure element at the center of that window, and of- 
ten assume that inter-amino acid correlations are infor- 
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FIG. 4: The hidden Markov model defined in Eos. IHITTl is 
able to extract over 95% of the available inter-sequence in- 
formation. Here, the efficiency, R = ip MM /i M and average 
3 state accuracy (Q3) are plotted against HMM window size 
(L — k + 1) for single sequence prediction on the SCOP 1.61 
40% STRIDE/CK data set. Results for the Barton data set 
are similar. Window sizes cannot be reliably extended beyond 
those shown here due to finite sequence data. The model 
information density, i„ M , approaches (but cannot exceed) 
the inter-sequence mutual information density, i M , indicating 
that the model is almost optimal. The prediction accuracy 
Q3 = 65.9 ± 0.3% at L = 9, is the same (within statistical 
errors) as the accuracy of a variety of comparable secondary 
structure prediction algorithms, suggesting that these algo- 
rithms are also almost optimal. 



mative. However, even nearest neighboring amino acids 
on the chain are only weakly correlated, and these cor- 
relations provide negligible information about the local 
structure. 

As an alternative prediction algorithm, we have con- 
structed a relatively simple hidden Markov model 
(HMM) (Eqs. I9TTT1 Fig. EJ) that embodies three key ap- 
proximations; that protein sequences are statistically ho- 
mogeneous, that direct secondary structure to primary 
structure interactions are local along the chain, and that 
amino acids at neighboring sites are independent. In- 
stead of a large primary structure window, we use short, 
overlapping secondary structure windows. Similar mod- 
els, with similar assumptions, can be found in the work 
of Thompson and Goldstein 1 ^ and Schmidler et al&. 

We estimate the amount of information that the HMM 
successfully extracts by measuring the mean log odds 
of the observed secondary structure fragments (Eq. |SJ), 
and then extrapolating across different length scales to 
estimate the model mutual information density, i^ MM 
(Eq. [SJ. Since the maximum amount of information 
that can be extracted is the previously estimated inter- 
sequence mutual information density i^R'^S) (Fig. |3J), 
we may profitably consider the efficiency ratio, R = 
i^ MM /ifj,, which is plotted in figure^] This model is able 
to extract over 90% of the available information with a 
modest secondary structure window size of only L = 7. 
In other words, the prediction algorithm is almost opti- 
mal. 



The most common measure of secondary structure pre- 
diction quality is the average three state accuracy, Q3, 
the average fraction of residues that are correctly clas- 
sified as helix, strand or other. Prediction accuracy 
increases monotonically with window length, reaching 
65.9 ± 0.3% at L = 9 (See Fig. EJ). We cannot reliably 
increase the window size further due to the finite size of 
the training and test data sets. 

Prediction accuracy can vary considerably due to vari- 
ations in secondary structure assignment and due to vari- 
ations in the underlying data set itself. Our standard 
data set consists of 2,853 sequences derived from the 40% 
subset of SCOP release 1.61, with STRIDE 1 secondary 
structure assignments. We also considered prediction ac- 
curacies for the Cuff-Barton— library of 513 sequences, 
using STRIDE and DSSP— secondary structure assign- 
ments, and two different reductions of the STRIDE and 
DSSP alphabets to 3 states, the CK and EHL mappings 
(For details, see Materials and Methods). At L = 7 ac- 
curacy ranges from 63.6 ± 0.6% to 66.4 ± 0.7%. The 
maximum accuracy is achieved with the CK mapping, 
irrespective of the secondary structure assignment pro- 
gram. Essentially, the CK mapping produces more co- 
herent, less random secondary structure sequences than 
the EHL mapping, which leads to more facile prediction. 
Using the smaller library of only 513 sequences leads to 
substantial standard errors of about 0.6%, and to a large 
estimated bias of about 0.7%. (Without a bias correction 
our maximum reported accuracy would be 67%.) A num- 
ber of different secondary structure prediction algorithms 
have been tested upon the Barton data set. However, 
given these small sample errors and the variation due 
to changes in secondary structure assignment, we cannot 
statistically distinguish accuracies separated by less than 
about 2 point s 4 ' 17 . Since the range of reported accura- 
cies is about 65%-68%&i2iiii, we are obliged to conclude 
that many, very different secondary structure prediction 
algorithms are statistically indistinguishable. 

Our HMM model is almost optimal, in the sense that 
it extracts almost all of the available information. More- 
over, the accuracy of our model is approximately the 
same (within statistical and systematic errors) as the 
maximum accuracy of a variety of other secondary struc- 
ture prediction methods that utilize only local sequence- 
sequence correlations&i2ii2*Si. This suggests that these 
algorithms are also almost optimal, and that the modest 
prediction accuracy is due to the fundamental lack of lo- 
cal structure information. Conversely, the fact that these 
diverse, sophisticated prediction algorithms are not able 
to extract additional signal from local correlations indi- 
cates that we have not overlooked some subtle source of 
secondary structure information in our analysis of local 
inter-sequence correlations. 

It has been found that secondary structure prediction 
accuracy can be substantially enhanced by basing the 
prediction upon a multiple sequence alignment (MSA) of 
homologous protein sequences- 1 , rather than just a single 
sequence. Since protein structure tends to evolve rela- 
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tively slowly, the MSA essentially represents many, semi- 
independent amino acid sequences, each associated with 
approximately the same secondary structure sequence. 
How informative is this additional data? We extended 
our HMM to handle this evolutionary information (de- 
scribed in Eos. IllliH ^l. and tested the model on the MSAs 
provided with the Barton data se1*i&. This resulted in a 
three state accuracy of 72.2 ± 0.6%, an improvement of 
about 6 points over the equivalent single sequence re- 
sults. Since the information ratio for this data set was 
R « 1.3, this modest accuracy increase actually repre- 
sents a considerable increase in information. This ac- 
curacy is similar to reported accuracies of a number of 
other algorithms tested on this data seliSiiS. It may be 
that many profile based secondary structure prediction 
algorithms are essentially equivalent, and that the differ- 
ing results are due, primarily, to differences in the quality 
of the input alignments'*^. 



DISCUSSION 

Although local inter-sequence information is insuffi- 
cient to accurately determine secondary structure, such 
correlations are still useful to statistical tertiary struc- 
ture prediction algorithms. For example, in protein 
threading 2 ^ a primary sequence is matched to a struc- 
tural template using an amino-acid contact potential, 
and other similar potentials derived from sequence- 
structure correlations. Recently, the information con- 
tained in amino acid contacts was estimated to be about 
0.04 bits per contact, or 0.06 bits per residue^*, which 
can be compared to our estimate of 0.16 bits per residue 
of primary to secondary structure mutual information. 
Therefore, local structure potentials may be of under- 
appreciated importance to threading, and other simi- 
lar statistical structure prediction methods. Many such 
methods do consider secondary structure^, but some of 
these only consider the direct correlation between an 
amino acid and the secondary structure class at that 
one residue 5 * 7 -. By ignoring the correlations between an 
amino acid and an extended segment of local secondary 
structure such methods lose over half of the available 
local signal, and, unlike secondary structure prediction 
algorithms, are not optimal. 

Protein folding is also constrained by the scarcity of lo- 
cal structure information, since the mechanism by which 
information is extracted, either by a computer or physics, 
is irrelevant. Secondary structure must be predominately 
determined by non-local interactions, that in turn depend 
on the overall, native fold of the protein. But the native 
fold cannot be achieved until the native secondary struc- 
ture has formed. Therefore, protein folding must typi- 
cally proceed by a cooperative mechanism-4, where sec- 
ondary and tertiary structures form concurrently. Note, 
however, that since this conclusion is based upon a statis- 
tical analysis, it applies only to proteins on the average, 
and does not preclude particular proteins, or parts of pro- 



teins, from folding via a hierarchal mechanism 2 where 
pre-organized local secondary structure elements collapse 
successively into ever-larger structures. For example, it 
has been suggested that the B-domain of staphylococ- 
cal protein A^, a small, single domain protein, can fold 
extremely quickly because of its strongly defined native 
secondary structure, which persists even in the unfolded 
state. If this is a general property of fast folding proteins, 
then the widely divergent folding rates of single domain 
proteins may be strongly correlated with the accuracy to 
which a particular proteins secondary structure can be 
predicted from the primary sequence. 

There are at least two approaches to prediction that 
aim to circumvent the lack of local structure information. 
One is to utilize evolutionary information. Since protein 
sequences evolve more rapidly than protein structure, a 
multiple sequence alignment of a homologous family rep- 
resents many, semi- independent sequence samples of ap- 
proximately the same protein structure. Local structure 
prediction quality is then limited by the size of the fam- 
ily, the divergence of structure across the family, and 
the quality of the alignment. This strategy is commonly 
employed in secondary structure prediction^!, and im- 
provements in accuracy to about Q3 « 75% ± 3 are 
routino 3 i 15 i 18 i 26 i 27 . By modifying our HMM to use evo- 
lutionary profiles, we find that even a modest increase in 
prediction accuracy represents a substantial increase in 
secondary structure information. 

The alternative approach is to explicitly incorporate 
non-local interactions. This is essentially what threading 
attempts to do, although the relatively small magnitude 
of contact potential information suggests that the bulk of 
non-local information is subtle, and difficult to extract. 
Of course, in principle we can determine the full three- 
dimensional structure of a protein using an atomic de- 
tailed molecular simulation. Until this becomes routinely 
feasible, computational structure determination will have 
to proceeded via less direct, statistical approaches. 



MATERIALS AND METHODS 

Secondary Structure Library 

Ideally, a secondary structure library should be based 
upon a representative, high-quality and non-redundant 
subset of available protein structures. The Protein Data 
Bank (PDB) 28 currently contains contains over 20,000 
publicly accessible structures, but many of these are 
very similar, and many are of relatively low quality. 
The Structural Classification Of Proteins (SCOP)i2i±i 
database provides a convenient decomposition of PDB 
structures into domains, and the ASTRAL 29 - 30 com- 
pendium provides representative subsets of SCOP do- 
mains, filtered so that no two domains share more than a 
given percentage level of sequence identity. This filtering 
preferentially retains higher quality structures, as judged 
by AEROSPACI scores 3 ^, an agglomeration of several 



structure quality measures. We selected the ASTRAL 
40% sequence identity subset of SCOP release 1.61, which 
was further filtered to remove multi-sequence domains, 
SCOP classes f (membrane and cell surface proteins) and 
g (small proteins), and retain only those structures de- 
termined by X-ray diffraction at better than 2.5 A reso- 
lution. The protein sequences were taken from the AS- 
TRAL Rapid Access Format (RAF) sequence mappings 30 
which provides a more reliable and convenient represen- 
tation of the true sequence than the PDB ATOM or SE- 
QRES records. The secondary structure sequences were 
determined by the program STRIDE 1 , using each pro- 
tein's hydrogen bonding pattern and backbone torsional 
angles. STRIDE was unable to process a small fraction 
of SCOP domains, which were consequentially removed 
from further consideration. The resulting library con- 
tains 2,853 protein domains and 553,373 residues. 

For comparative purposes, we also studied the sec- 
ondary structure library of Cuff and BartoniS^, which 
consists of 513 proteins and 84,091 residues. Secondary 
structure assignments are provided by both STRIDE and 
the program DSSP—. This data set also includes, for 
each structure, a multiple alignment of homologous se- 
quences. These multiple sequence alignments were con- 
verted to amino acid probability profiles^ using the pro- 
gram hmmbuild from HMMER (v2.3)22. 

Both DSSP and STRIDE assign each residue's sec- 
ondary structure to one of 8 classes; a- helix (H) , 3io he- 
lix (G), 7r-helix (I), /3-strand (E), /3-bridge (B or b), Coil 
(C, L, or space), Turn (T) or Bend (S). Unstructured or 
poorly resolved regions of the protein are unassigned (X) . 
These 8 classes were reduced to the three letter alphabet, 
E (Extended strand), H (Helix), and L (Loop/Other) 
using the common CK mapping 18 ' 33 34 E^E; H— >H; all 
others— >L. We also considered another common reduc- 
tion, the "EHL" mapping^ E, B^E; H, G, I-»H; all 
othcrs^L. 



Entropy Estimation and Bias Correction 

The entropy of a discrete probability can be estimated 
by sampling from the distribution, and then replacing 
the true probabilities, P(x), by the observed frequencies, 
f(x) = n x /N. Here, N is the total number of samples, 
and n x is the number of observations of state x. A useful 
alternative approach is to construct an approximation of 
the true probabilities, g(x) w P(x) (e.g. Eq. Illjl . and 
then estimate the entropy by the mean log likelihood of 
the data£&. 

H(X) = E(log 2 P(X)) 

N 1 

> E(log 2ff (A)) ^-log^fo) (7) 
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FIG. 5: A factor graph— for P(S\R), representing the de- 
composition of this complex, many variable function into sim- 
pler parts (Eg. 1911 Hi . Circles represent variables and squares 
represent factors, local functions of relatively few variables. 
The upper and lower rows of circles represent the primary 
and secondary structure sequences, respectively. In this dia- 
gram k — 2m = 6, for a window of size k + 1 = 7. One set 
of factors, centered on sequence position i, have been high- 
lighted. The bottom factor connects k + 1 neighboring sec- 
ondary structure elements, and represents the approximation 
of the secondary structure sequence probability by a fcth order 
Markov chain (Eg. II IB . The factor between the chains rep- 
resents the inter-sequence dependance (Eq. I1UI . Thus, each 
residue is directly dependent upon a window of secondary 
structure (length 2m + 1), and is conditionally independent 
of neighboring residues. 

Similarly, the mutual information can be related to the 
mean log odds, since 

(8) 

A serious problem with cither approach is that en- 
tropies estimated from limited amounts of data tend to 
be significantly biased 3 ^, resulting in a systematic un- 
derestimation of the true entropy, or overestimation of 
the mutual information. We used non-parametric boot- 
strap resampling 3 ^ to correct for this bias, and to es- 
timate standard statistical errors. Fifty replicas of the 
original data are generated by sampling, with replace- 
ment, from the available sequences. This resampling has 
associated systematic and random errors that are approx- 
imately the same as the errors introduced by the original 
finite sampling of sequences from the true random distri- 
bution. These error estimates were not significantly im- 
proved when the number of replicas was increased from 
50 to 500. The requisite pseudo-random numbers were 
drawn from the Mersenne Twister generator^M .. 



Secondary Structure Hidden Markov Model 

The probability P(S\R) of a secondary structure se- 
quence, S, given the primary sequence, R, can be rewrit- 
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ten using Bayes' rule as 

P (S|B) = £m_p. („ 

Since the probability of an amino acid residue depends on 
the local secondary structure, and is almost independent 
of the identity of neighboring residues, to a good approx- 
imation the probability of each residue can be estimated 
from a short window of local secondary structure, 

L 

P(R\S)^l[P(R i \S [ ^ mji+m] ). (10) 

Here, is the element at position i and Zu j is a sub- 
sequence of length i — j + 1 starting at position i and 
ending at j. Residues beyond the termini of the actual 
sequence (i < 1, i > L) are treated as undetermined. The 
window size, 2m + 1, is an adjustable parameter of the 
model, and need not be particularly long, since the inter- 
sequence correlations have a characteristic length scale of 
only about 4 residues (See Fig- • We approximate the 
prior probability of the secondary structure sequence by 
a fcth order Markov chain, 

L-k 

P(S) « P (S [1M ) [] P (S i+ fc|S [M+fc _i]) . (11) 

i=l 

The primary structure sequence probabilities P(R) can 
be determined from normalization. Combining the pre- 
ceding approximations, fEa. HOI and lllfl . using k = 2m for 
consistency, generates a hidden Markov model (summa- 
rized in Fig. 01 that emits the primary structure sequence 
on transitions between blocks of secondary structure of 
length k. 

The probabilistic model of Eas. liJIlfl can be generalized 
so that the prediction is based upon a multiple sequence 
alignment (MSA) of homologous protein sequences. 
First, we convert the multiple sequence alignment into 
an amino acid profile, 8 = {8i(r), 62(f), . . . , 0t,(r)}, that 
represents the probabilities of each amino acid at each 
position of the protein of interest^!. The secondary struc- 
ture probability, given this profile, may then be approxi- 
mated as 

Pm = -sap, (12) 

L 

P(6\S) « ]]P (8 t \S [% _ m , t+m] ) . (13) 
»=i 

We expect that each residue's observed homology 
profile, 9i(r), will vary from the structure profile, 
P(f\Su_ m ^ +m y), due to sampling errors, random site- 
to-site variation, inter-protein structural variation and 
because each residue is under different structural, func- 
tional and evolutionary constraints. As a simple approxi- 
mation, we use the large deviation distribution 9 to model 



the variation of the observed profile from the expected 
profile; 

P(8i\/3, S[ i _ m i+rn ] ) 

«exp{ - 0D(9 i (r)\\P(r\S [i _ m , i+m] ))}. (14) 

Here, D{p\\q) = X)i m (P*/9») is the relative entropy. 
We treat (3 as an empirical dispersion parameter that is 
independent of the secondary structure or primary struc- 
ture profile. 

Computationally, the conditional secondary structure 
probabilities can be derived from the amino acid sequence 
using the standard forward-backward dynamic program- 
ming algorithm—. The time and memory complexities 
for a naive implementation are 0(L3 k ), which, despite 
the exponential scaling, is feasible for moderate k. For 
example, with k — 7 training on one half of our library 
(2853 sequences) required 4 seconds from a modest con- 
temporary PC (667 MHz PowerPC G4), and prediction 
of the other half required approximately 5 minutes, or 
about 5 sequences per second. In principle, a more effi- 
cient implementation is possible, since, although the to- 
tal number of secondary structure sequences scales as 
3 L , the number of typical sequences with non- negligible 
probability scales as 2 H ^ S > w 1.5 L , by the asymptotic 
equipartition principle^. The optimal prediction at a par- 
ticular site is the secondary structure element with the 
greatest posterior probability. 

The available sequence data was partitioned every 
other sequence into disjoint test and training sets of ap- 
proximately equal size. The training set was used to esti- 
mate secondary structure block probabilities, P (<S[t i+fci) 
(regularized with a Laplace pseudocount of 1) and corre- 
sponding amino acid profiles, P (Ri\Su^ m ^ +m ]j (regular- 
ized with a pseudocount of 20 times the amino acid back- 
ground probability). Statistical errors were estimated 
from a full bootstrap resampling of both the test and 
training sequences. 



Avaliability 

Both the data sets and second-hmm, the program de- 
veloped for this analysis, are freely available from our 
web site at http : / / compbio . berkeley . edu/ 
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